

# NetSpeed ORION: A New Approach to Design On-chip Interconnects

August 26<sup>th</sup>, 2013

#### © Copyright 2013 NetSpeed Systems

## INTERCONNECTS BECOMING INCREASINGLY IMPORTANT

- Growing number of IP cores
  - Average SoCs today have 100+ IPs
  - Mixing and matching of IP cores

- Increasing design complexity
  - Higher performance
  - Complex traffic, QoS and deadlock
  - Power Management

- Increasing off-the-shelf IPs use
  - Most SoC components available as standard IPs
  - Designing SoC mean designing the interconnect
  - Interconnect affects schedule 50% of the time



90 nm 65 nm 45 nm 28 nm 22 nm

Source: International Business Strategies, 2012







## **NOC DESIGN SOLUTIONS**



• Key NoC design elements and challenges are:



Existing commercial NoC solutions: Focused on hardware



Visio like GUI and NoC design

#### Key Design Challenges Not Addressed

- 1. What is the right NoC topology?
- 2. How many NoC layers and channels are needed to satisfy bandwidth?
- 3. Is the NoC deadlock free?
- 4. Are end-to-end QoS requirements satisfied?
- 5. Is the NoC efficient? Are NoC channels and buffers optimized?



| ARCHITECTURE<br>AUTOMATION | SoC traffic specifications | oC channels and buffers<br>affic to NoC channels using high level |
|----------------------------|----------------------------|-------------------------------------------------------------------|
|                            | Fine grained control to a  | rchitect after optimizations                                      |

# STATE-OF-THE-ART End-to-end QoS ALGORITHMS Network and Protocol level deadlock free Placement aware SoC IP core to NoC mapping

- End-to-end QoS
- Placement aware SoC IP core to NoC mapping



- Physically aware NoC design
- Adaptable and scalable architecture
- Optimal latency and power consumption
- Correct by construction and faster time to market

## **DESIGN FLOW:** SPECIFY, (OPTIMIZE) & GENERATE





## **Specify** $\rightarrow$ Optimize $\rightarrow$ Generate



- Reference 2D grid is used to define the IP cores in SoC
  - Enables our solution to create physically aware interconnect
  - Automated and visual design and analysis of the interconnect
- Add SoC IPs
  - Rough SoC floorplan description
  - Position, shape, size, ports & protocols
- Add Traffic between components creates baseline NoC design



## Specify $\rightarrow$ **Optimize** $\rightarrow$ Generate

- Algorithmic optimizations to meet system specifications (Bandwidth, latency, QoS)
  - Links & Channels: optimize bandwidth and QoS
  - Floorplan: Optimize placement of IP blocks
  - Topology: Final NoC can morph into bus, tree, mesh, other
  - Routes: Leverage path diversity between blocks







## Specify $\rightarrow$ Optimize $\rightarrow$ **Generate**

- Simulator to characterize performance of NoC
  - Performance details
    - Visual display of traffic congestion
    - Performance probes
    - Detailed statistics
- Synthesizable RTL
- Functional C++ model
- Verification test bench

| NocStudio Display |                                          | C:\Users\mrkt2\Doc      | uments\Net    | Speed. 💻 🗵     |
|-------------------|------------------------------------------|-------------------------|---------------|----------------|
|                   | Link cost = 13032<br>Buffer cost = 81954 | hdmi/s, 50H             | ->            | spurn, 3911    |
|                   |                                          | grpu./m. 39H            | $\rightarrow$ | pcie∕s, 37H    |
|                   |                                          | gpu/m_ 39H              | ->            | nand/s, 44H    |
|                   |                                          | aba∿w 238H              | ->            | ddr∕s, 8H      |
| 8 9 1             | 0 11 12                                  | ՅԴու√ա՝ 3ծዘ             | ->            | i2c2/s, 301    |
| 1.2               | Real Development                         | abu√u° 38H              | ->            | spi/s, 9H      |
| 15 16             | 18                                       | aDur∖u° 3AH             | ->            | sata/s. 45H    |
|                   | 0                                        | abn∿u <sup>x</sup> 38if | ->            | usb/s, 52H     |
| 23                | 25                                       | gpu∕n <sub>2</sub> 39H  | ->            | i2c1/s, 11H    |
|                   |                                          | abn∿u 33H               | ->            | videod/s, 23H  |
| 20                | od 92 - 33                               | gpu,∕n, 39H             | ->            | 13/s, 25H      |
| 29                | 33                                       | gpu∕n 39B               | ->            | dsp∕s, 15H     |
| and a second      |                                          | gpu∕n, 398              | ->            | audio/s, 29H   |
| 36 37 3           | 8 39 40                                  | ອງວແ∕ຫຼີ 398            | ->            | videoe/s. 16H  |
|                   |                                          | 930u∕m, 39H             | ->            | mipi/s, 12H    |
| - 15 pt 44 4      | 5 - 46                                   | ggpu/m_ 3911            | ->            | display/s, 36H |
| 10 M              | 0                                        | abn~w <sup>2</sup> 3ah  | ->            | hdmi/s, 50H    |
| 50 2              | 2 53                                     | ցքա/ո_ 398              | ->            | uart/s, 10H    |
|                   |                                          | uart/s, 10H             | ->            | cpu1/m, 32H    |
|                   |                                          | uart/s, 10H             | ->            | epu2/n, 33H    |
|                   |                                          | uart/s, 10H             | ->            | ցքաւ/m, 39H    |
|                   |                                          |                         |               |                |
| Layer 0           |                                          | \$                      |               |                |







## AUTOMATED DESIGN AND TOPOLOGY



| WHICH<br>TOPOLOGY | ļ | A variety of standard topologies: Ring, Bus, Fat tree, Mesh<br>What is the most optimal topology for a SoC family?<br>Do we always have to use a standard textbook NoC topology? |
|-------------------|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                   |   |                                                                                                                                                                                  |

# IS THIS TOPOLOGY EFFICIENT FOR MY SOC

- Standard topologies may be inefficient for a SoC
- Would a topology satisfy the bandwidth
- Would it provide optimal latency
- Machine learning algorithms to automatically determine most optimal topology
  - Adaptive topology for any given connectivity
  - Fully heterogeneous in channel/buffer sizes
  - SoC Floorplan aware









## PHYSICALLY AWARE, EFFICIENT, CORRECT BY CONSTRUCTION





- 1. What is the most optimal floorplan?
- 2. Given a Floorplan, Design NoC that is correct by construction
- Use heuristic algorithms simulated annealing
- Use machine learning and graph theory for efficient design

## AUTOMATED AND EFFICIENT SCALING

- In addition to Flexibility, heterogeneity and correctness
- Scales efficiently
  - Scale with single NoC layer
  - Scale with multiple NoC layers





## QUANTIFYING EFFICIENCY



### BASELINE NOC



## POWER AND AREA OPTIMIZATIONS



#### **SYNTHESIS RESULTS**

| Router configuration       | Area<br>(sq. um) | Freq.<br>(GHz) | Leakage<br>Power (mW) |  |
|----------------------------|------------------|----------------|-----------------------|--|
| 32bit data, 4 VC, 5 ports  | 59952            | 1.984          | 0.126                 |  |
| 64bit data, 2 VC, 5 ports  | 41201            | 1.984          | 0.1718                |  |
| 64bit data, 4 VC, 5 ports  | 90252            | 1.888          | 0.183                 |  |
| 128bit data, 4 VC, 5 ports | 142576           | 1.872          | 0.3                   |  |
| 256bit data, 2 VC, 5 ports | 120075           | 1.984          | 0.54                  |  |
| 256bit data, 4 VC, 5 ports | 162890           | 1.728          | 0.5647                |  |

## ALGORITHMIC: DEADLOCK AVOIDANCE



#### Network-Level Deadlock



- Challenges:
  - Irregular topology
  - Complex system traffic
    - Inter-dependent messages
    - Multiple protocols in an SoC
    - Complex ordering requirements
- Solution
  - NetSpeed ORION uses graph theory algorithms and formal techniques
    - Correct by construction
    - NoC design remains efficient
    - robust (can handle complex irregular topologies and routing, etc.)





## ALGORITHMIC: QOS



#### **ALGORITHMIC AUTOMATED Traditional Flow NetSpeed Flow** Ideal QoS scheme characteristics: Fairness (Strict & weighted) Dynamic (Adjusts allocation among active agents) **SoC Requirements** SoC Requirements Work-conserving (High agents utilization) Low-cost (HW cost of implementation) Architect Limitation **Existing QoS schemes NocStudio** \* QoS scheme VC design Automated Flow Wastes resource bandwidth Limit outstanding messages ✓ Fully automated flow Rate limit sources Wastes resource bandwidth Design ✓ Correct by Weight-based arbitration Unfair in deep networks \* Configure VC construction \* Configure routing, ✓ Advanced algorithms Arbitration Token passing Unfair and slow response to achieve QoS specs Unfair in deep networks Age-based arbitration ✓ Optimal allocation of Implement NoC channels \* Choose HW lib **NetSpeed Solution:** Place, connect Distributed QoS algorithms for robust, end-to-end QoS with a low cost implementation Validate Validate intent Sophisticated, Efficient and Fair

#### © Copyright 2013 NetSpeed Systems

## **QOS: NOC RTL SIMULATION RESULTS**



#### **DIFFERENT WEIGHTS**



| Agents                         | 1 | 2    | 3 | 4   | 5 | 6    | 7    | 8         |
|--------------------------------|---|------|---|-----|---|------|------|-----------|
| % BW<br>Target                 | 5 | 25   | 5 | 5   | 5 | 25   | 10   | 20        |
| % BW<br>Target<br>(Normalized) | - | 38.5 | - | 7.7 | - | 38.5 | 15.3 | -         |
| % BW<br>Actual                 |   | 39   |   | 7   |   | 38   | 15   | $\square$ |

### **DIFFERENT WEIGHTS (INACTIVE AGENTS)**



# VERIFICATION: CHALLENGES & SOLUTION



- The challenge: How to verify given such flexibility?
- Solution: 3-tier approach to verification



- Highly parameterized design
- Random and directed tests
- Scoreboards for verifying QoS, arbitration & routing





- Two parallel robust benches
- Thousands of random & directed NoC configurations
- Verification flow configurable through NocStudio
- Auto generated expect & predict UVM testbench
- Full randomization of stimulus & flow control
- Global debug monitor tracks every packet through the NoC



- Successfully emulated multiple NoC configurations
- Full randomization of stimulus with end-to-end self-checking tests
- 100 Million packets through the NoC every night



## **NETSPEED ORION VS. COMPETITION**







Quantum Leap in SoC Interconnect Design

Automated

Efficient

Algorithmic

Coherent\*

- Optimized Design
- Faster Time-to-market
- Higher Performance
- Lower Power

... Creates New Design Possibilities

#### \* Fully coherent NoC coming in Q3 2013